43 research outputs found

    Self-organizing strategies for a column-store database

    Get PDF
    Column-store database systems open new vistas for improved maintenance through self-organization. Individual columns are the focal point, which simplify balancing conflicting requirements. This work presents two workload-driven self-organizing techniques in a column-store, i.e. adaptive segmentation and adaptive replication. Adaptive segmentation splits a column into non-overlapping segments based on the actual query load. Likewise, adaptive replication creates segment replicas. The strategies can support different application requirements by trading off the reorganization overhead for storage cost. Both techniques can significantly improve system performance as demonstrated in an evaluation of different scenarios

    Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories

    Get PDF
    In this short paper we outline the Data Vault, a database-attached external file repository. It provides a true symbiosis between a DBMS and existing file-based repositories. Data is kept in its original format while scalable processing functionality is provided through the DBMS facilities. In particular, it provides transparent access to all data kept in the repository through an (array-based) query language using the file-type specific scientific libraries. The design space for data vaults is characterized by requirements coming from various fields. We present a reference architecture for their realization in (commercial) DBMSs and a concrete implementation in MonetDB for remote sensing data geared at content-based image retrieval

    Just-in-time Data Distribution for Analytical Query Processing

    Get PDF
    Distributed processing commonly requires data spread across machines using a priori static or hash-based data allocation. In this paper, we explore an alternative approach that starts from a master node in control of the complete database, and a variable number of worker nodes for delegated query processing. Data is shipped just-in-time to the worker nodes using a need to know policy, and is being reused, if possible, in subsequent queries. A bidding mechanism among the workers yields a scheduling with the most efficient reuse of previously shipped data, minimizing the data transfer costs. Just-in-time data shipment allows our system to benefit from locally available idle resources to boost overall performance. The system is maintenance-free and allocation is fully transparent to users. Our experiments show that the proposed adaptive distributed architecture is a viable and flexible alternative for small scale MapReduce-type of settings

    SciQL, Bridging the Gap between Science and Relational DBMS

    Get PDF
    Scientific discoveries increasingly rely on the ability to efficiently grind massive amounts of experimental data using database technologies. To bridge the gap between the needs of the Data-Intensive Research fields and the current DBMS technologies, we propose SciQL (pronounced as ‘cycle’), the first SQL-based query language for scientific applications with both tables and arrays as first class citizens. It provides a seamless symbiosis of array-, set- and sequence- interpretations. A key innovation is the extension of value-based grouping of SQL:2003 with structural grouping, i.e., fixed-sized and unbounded groups based on explicit relationships between elements positions. This leads to a generalisation of window-based query processing with wide applicability in science domains. This paper describes the main language features of SciQL and illustrates it using time-series concepts

    SciQL, A query language for science applications

    Get PDF
    Scientific applications are still poorly served by contemporary relational database systems. At best, the system provides a bridge towards an external library using user-defined functions, explicit import/export facilities or linked-in Java/C# interpreters. Time has come to rectify this with SciQL, a SQL-query language for science applications with arrays as first class citizens. It provides a seamless symbiosis of array-, set-, and sequence- interpretation using a clear separation of the mathematical object from its underlying storage representation. The language extends value-based grouping in SQL with structural grouping, i.e., fixed-sized and unbounded groups based on explicit relationships between its index attributes. It leads to a generalization of window-based query processing. The SciQL architecture benefits from a column store system with an adaptive storage scheme, including keeping multiple representations around for reduced impedance mismatch. This paper is focused on the language features, its architectural consequences and extensive examples of its intended use

    Data Vaults: Database Technology for Scientific File Repositories

    Get PDF
    Current data-management systems and analysis tools fail to meet scientists’ data-intensive needs. A "data vault" approach lets researchers effectively and efficiently explore and analyze information

    An architecture for recycling intermediates in a column-store

    Get PDF
    Automatically recycling (intermediate) results is a grand challenge for state-of-the-art databases to improve both query response time and throughput. Tuples are loaded and streamed through a tuple-at-a-time processing pipeline avoiding materialization of intermediates as much as possible. This limits the opportunities for reuse of overlapping computations to DBA-defined materialized views and function/result cache tuning. In contrast, the operator-at-a-time execution paradigm produces fully materialized results in each step of the query plan. To avoid resource contention, these intermediates are evicted as soon as possible. In this paper we study an architecture that harvests the by-products of the operator-at-a-time paradigm in a column store system using a lightweight mechanism, the recycler. The key challenge then becomes selection of the policies to admit intermediates to the resource pool, their retention period, and the eviction strategy when facing resource limitations. The proposed recycling architecture has been implemented in an open-source system. An experimental analysis against the TPC-H ad-hoc decision support benchmark and a complex, real-world application (SkyServer) demonstrates its effectiveness in terms of self-organizing behavior and its significant performance gains. The results indicate the potentials of recycling intermediates and charters a route for further development of database kernels

    An architecture for recycling intermediates in a column-store

    Get PDF
    Automatic recycling intermediate results to improve both query response time and throughput is a grand c

    Lazy ETL in Action: ETL Technology Dates Scientific Data

    Get PDF
    Both scientific data and business data have analytical needs. Analysis takes place after a scientific data warehouse is eagerly filled with all data from external data sources (repositories). This is similar to the initial loading stage of Extract, Transform, and Load (ETL) processes that drive business intelligence. ETL can also help scientific data analysis. However, the initial loading is a time and resource consuming operation. It might not be entirely necessary, e.g. if the user is interested in only a subset of the data. We propose to demonstrate Lazy ETL, a technique to lower costs for initial loading. With it, ETL is integrated into the query processing of the scientific data warehouse. For a query, only the required data items are extracted, transformed, and loaded transparently on-the-fly. The demo is built around concrete implementations of Lazy ETL for seismic data analysis. The seismic data warehouse is ready for query processing, without waiting for long initial loading. The audience fires analytical queries to observe the internal mechanisms and modifications that realize each of the steps; lazy extraction, transformation, and loading